In the present document, we aim to conduct a systematic review of the literature on short term health effects of air pollution. The objective is two fold: - Retrieve effect sizes and confidence intervals in order to compute power, type M and type S error in the literature - Get a sense of the proportion of papers in this literature discussing power and missing data issues.
To carry out this analysis, we use the fulltext package to get the abstract and full text of each article corresponding to our search query. This set of articles might be too restrictive and we may want to broaden our scope later.
To begin with, we focus on articles published on Scopus and Pubmed. To access Scopus API, one needs to register to get an API key (stored in the .Renviron) for Elsevier and a Crossref TDM API key. Note that downloading of full texts may not work if one is not connected directly to their institution internet network. Pubmed articles are accessed via Entrez. An API key enables to increase the number of requests per seconds from 3 to 10. More information on authentification is available on the fulltext manual.
First of all, we need to clearly define the set of articles we want to consider in this analysis. Our baseline search query is: "air pollution" AND ("emergency" OR "mortality") AND ("particulate matter" OR ozone OR "nitrogen dioxyde" OR "sulfur dioxyde"). This returns about 57000 articles (without accounting for duplicates in between journals). This number being rather large, we want to narrow it down. To focus on the epidemiological literature, we add the query term poisson. To get causal analyses, we add the term causal.
#set folder to store downloads
cache_options_set(full_path = "/Users/vincentbagilet/Documents/fulltext")
# cache_options_set(full_path = "/Volumes/VincentSSD/Research_data/air_pollution/lit_review/full_texts")
query <- 'TITLE(("air pollution" OR "air quality" OR "particulate matter" OR ozone OR "nitrogen dioxide" OR "sulfur dioxide") AND ("emergency" OR "mortality") AND NOT ("long term" or "long-term")) AND ("particulate matter" OR ozone OR "nitrogen dioxide" OR "sulfur dioxide")'# AND (poisson OR causal)'
opts_entrez <- list(use_history = TRUE)
Run a search
# search <- ft_search(query, from = c("scopus", "entrez"), limit = 50)
search <- ft_search(query, from = "scopus", limit = 2000)
search_entrez <- ft_search(query, from = "entrez", limit = 600, entrezopts = opts_entrez)Then, we retrieve the related metadata. The metadata from different sources having different shapes, we only select a few relevant columns to build an overall metadata set.
metadata_scopus <- search$scopus$data %>%
as_tibble() %>%
rename_all(function(x) str_remove_all(names(.), "prism:|dc:")) %>%
rename_all(function(x) str_replace_all(names(.), "-", "_")) %>%
select(doi, title, creator, publicationName, pubmed_id, coverDate) %>%
rename(
authors = creator,
journal = publicationName
) %>%
mutate(
pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
pub_date = ymd(coverDate)
) %>%
select(-coverDate)
metadata_entrez <- search_entrez$entrez$data %>% #search_entrez$entrez$data
as_tibble() %>%
# rename(id = uid) %>%
select(doi, title, authors, fulljournalname, pmid, pubdate) %>%
rename(
journal = fulljournalname,
pubmed_id = pmid
) %>%
mutate(
pubmed_id = ifelse(!str_detect(pubmed_id, "[0-9]{7}"), NA, pubmed_id),
pub_date = ymd(pubdate)
) %>%
select(-pubdate)
metadata_lit_review <- metadata_scopus %>%
rbind(metadata_entrez) %>%
filter(!is.na(doi))
saveRDS(metadata_lit_review, "../Outputs/metadata_lit_review.RDS")
The full list of articles studied here is as follows:
First, we focus on the abstracts and download them.
There is no fulltext function to access abstracts from Entrez. Therefore, using the DOI, we get the abstracts from Semantic Scholar. We also access Scopus abstracts from Semantic Scholar since, due to my IP address being located outside of Columbia, I cannot access the texts and abstracts from Scopus.
In Semantic Scholar, there is a rate limit of 100 articles per 5 min or 20 articles per minute. We therefore need to pause the system to be able to download everything. In addition, there are some problems with some DOIs and thus need to use tryCatch to record where these errors are, to be able to handle them later.
get_abstracts <- function(doi) {
vect_doi <- unique(doi)
number_periods <- (length(vect_doi) - 1) %/% 20
abs <- NULL
message(str_c("Total downloading time: ", number_periods, "min"))
for (i in 0:number_periods) {
doi_period <- vect_doi[(20*i+1):(20*(i+1))]
doi_period <- doi_period[!is.na(doi_period)]
skip_to_next <- FALSE #to handle issues, using tryCatch
possible_error <- tryCatch(
abs_period <- doi_period %>%
ft_abstract(from = "semanticscholar") %>%
.$semanticscholar %>%
as_tibble() %>%
unnest(cols = everything()) %>%
pivot_longer(everything(), names_to = "doi", values_to = "abstract") %>%
filter(doi != abstract),
error = function(e) e
)
if (inherits(possible_error, "error")) {
print(i)
# err_indices <- c(err_indices, i)
next
} else {
abs <- abs %>%
rbind(abs_period)
}
if (i < number_periods & number_periods != 0) {
message(str_c("Remaining time: ", (number_periods - i), "min"))
Sys.sleep(63)
}
}
return(abs)
}
abstracts_t <- metadata_lit_review$doi %>%
get_abstracts() %>%
left_join(metadata_lit_review, by = "doi") %>%
distinct(title, .keep_all = TRUE)
# saveRDS(abstracts, "../Outputs/abstracts.RDS")
Based on the abstracts, we can briefly explore the main themes of the articles.
abstracts <- readRDS("../Outputs/abstracts.RDS")
abstracts %>%
unnest_tokens(word, abstract, to_lower = TRUE) %>%
anti_join(tidytext::stop_words, by = "word") %>%
count(word) %>%
with(wordcloud::wordcloud(word, n, max.words = 100, random.color = TRUE))
Then we want to extract the effects, odds ratio and associated confidence intervals in the abstracts. To identify the effect and CI we proceed as follows:
effects_CI <- abstracts %>%
mutate(
abstract = str_replace_all(abstract, "·", ".")
) %>%
select(doi, abstract) %>%
unnest_tokens(sentence, abstract, token = "sentences", to_lower = FALSE, drop = FALSE) %>%
mutate(
contains_CI = str_detect(sentence, "(95%|\\bCI\\b|\\b(c|C)onfidence (i|I)nterval\\b)")
) %>%
filter(contains_CI) %>%
mutate(
CI = str_extract_all(sentence, "((?<=(95%|\\bCI\\b|\\b(c|C)onfidence (i|I)nterval\\b)[^\\d\\.]{0,4})[-−]?[\\d\\.]{1,7}[–\\s:;,%-to]{1,4}[-−]?[\\d\\.]{1,5})|(?<=(\\(|\\[))[\\d-]{1,4}\\.?\\d{0,5}[–\\s:;,%-to]{1,4}[\\d-]{1,4}\\.?\\d{0,5}(?=[\\)\\];])"),
effect = str_extract_all(sentence, "[-−]?[\\d\\.]{1,7}(?=[^\\d.]{0,7}([^\\.\\d]\\b95%|\\bCI|\\b(c|C)onfidence (i|I)nterval\\b))(?<!\\b95)|[-−]?[\\d\\.]{1,7}(?=%?\\s?(\\(|\\[)[\\d-]{1,4}\\.?\\d{0,5}[–\\s:;,%-to]{1,4}[\\d-]{1,4}\\.?\\d{0,5}[\\)\\];])(?<![^\\.\\d]95)")
)
#In effect_OR, could replace [^\\d.] by [\\s:;,\\(\\[%] to be more precise
#previous version
# %>%
# mutate(
# effect = str_extract_all(sentence, "(?<=\\b(with a |with ))[\\d.]+"),
# bound_CI = str_extract_all(sentence, "(?<=\\bCI.{0,4})[−\\d.-]+[\\s,-–]{0,3}[−\\d.-]+"),
# OR = str_extract_all(sentence, "(?<=\\bOR.{0,4})[\\d.]+")
# )
Here are examples of the confidence intervals detected using this method:
random_sentences <- sample(1:length(effects_CI$sentence), 5)
str_view_all(effects_CI$sentence[random_sentences], "((?<=(95%|\\bCI\\b|\\b(c|C)onfidence (i|I)nterval\\b)[^\\d\\.]{0,4})[-−]?[\\d\\.]{1,7}[–\\s:;,%-to]{1,4}[-−]?[\\d\\.]{1,5})|(?<=(\\(|\\[))[\\d-]{1,4}\\.?\\d{0,5}[–\\s:;,%-to]{1,4}[\\d-]{1,4}\\.?\\d{0,5}(?=[\\)\\];])")
The corresponding detected effects are:
str_view_all(effects_CI$sentence[random_sentences], "[-−]?[\\d\\.]{1,7}(?=[^\\d.]{0,7}([^\\.\\d]\\b95%|\\bCI|\\b(c|C)onfidence (i|I)nterval\\b))(?<!\\b95)|[-−]?[\\d\\.]{1,7}(?=%?\\s?(\\(|\\[)[\\d-]{1,4}\\.?\\d{0,5}[–\\s:;,%-to]{1,4}[\\d-]{1,4}\\.?\\d{0,5}[\\)\\];])(?<![^\\.\\d]95)")
Once the effects and CI are identified, some wrangling is necessary in order to get the data into a usable format.
effects_CI_clean <- effects_CI %>%
filter(lengths(effect) == lengths(CI)) %>%
unnest(c(effect, CI), keep_empty = TRUE) %>%
separate(CI, into = c("low_CI", "up_CI"), "([\\s,]+)|(?<!^)[-–]") %>%
mutate(across(c("effect", "low_CI", "up_CI"), .fns = as.numeric)) %>%
mutate(
low_CI = ifelse(is.na(up_CI), NA, low_CI),
up_CI = ifelse(is.na(low_CI), NA, up_CI),
effect = ifelse(is.na(low_CI), NA, effect),
pb_or_not_significant = (effect <= low_CI | effect >= up_CI)
) %>%
select(-sentence, -abstract) %>%
filter(!is.na(effect))
articles_effect <- abstracts %>%
left_join(effects_CI_clean, by = "doi")
articles_effect %>%
group_by(doi) %>%
summarise(has_effect = mean(effect, na.rm = TRUE)) %>%
count(ret = !is.nan(has_effect)) %>%
mutate(ret = ifelse(ret, "Yes", "No")) %>%
kable(col.names = c("Effect retreived", "Number of articles"))
| Effect retreived | Number of articles |
|---|---|
| No | 972 |
| Yes | 330 |
It might also be interesting to information about the study period, the number of observations or p-values. We also use regex to recover these information.
month_regex <- "\\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?)"
date_regex <- str_c("(",month_regex, "\\s){0,1}(19|20)\\d{2}")
abstracts_more_info <- abstracts %>%
mutate(
abstract = str_replace_all(abstract, "·", ".")
) %>%
select(doi, abstract) %>%
mutate(
dates_obs = str_extract_all(abstract, str_c(date_regex, "( to |\\s?—\\s?|\\s?-\\s?| and )", date_regex))
) %>%
unnest(dates_obs, keep_empty = TRUE) %>%
separate(dates_obs, into = c("begin_obs", "end_obs"), "( to |\\s?—\\s?|\\s?-\\s?| and )") %>%
mutate(
p_value = str_extract_all(abstract, "(?<=\\b(p|P|p-value|p value)\\s?)[=<]\\s?[\\d.]+"),
n_obs = str_extract_all(abstract, "(?<=(\\bn\\s?= ))[\\d,]{1,10}|[\\d,]{1,8}(?=(\\sobservations))")
)
In this section, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. To do so, we use the package retrodesign and suppose that the true effect is a fraction of the measured effect. To use the retro_design() function, we first need to compute the standard error of the estimate. Probably due to rounding effect, we often do not get the same value for the standard error if we compute it using the upper or the lower bound of the CI. Thus, we average across the two values obtained.
articles_retro <- articles_effect %>%
mutate(
se_up = (up_CI - effect)/1.96,
se_low = (low_CI - effect)/(-1.96),
se = (se_up + se_low)/2
) %>%
select(-se_low, -se_up) %>%
filter(!is.na(effect)) %>%
mutate(
# retro_0.01 = as.tibble(retro_design(effect*0.01, se)),
# retro_0.05 = as.tibble(retro_design(effect*0.05, se)),
# retro_0.1 = as.tibble(retro_design(effect*0.1, se)),
retro_0.5 = as.tibble(retro_design(effect*0.5, se)),
retro_0.67 = as_tibble(retro_design(effect*0.67, se)),
retro_0.75 = as_tibble(retro_design(effect*0.75, se)),
# retro_0.9 = as_tibble(retro_design(effect*0.9, se))
) %>%
pivot_longer(
starts_with("retro"),
names_to = "prop_true_effect",
values_to = "computed"
) %>%
mutate(
power = computed$power,
typeS = computed$typeS,
typeM = computed$typeM,
prop_true_effect = as.numeric(str_sub(prop_true_effect, 7, nchar(prop_true_effect)))
) %>%
select(-computed)
Then, we quickly explore the results. First, we look at the distribution of power, type M and type S error across simulation and for different size of true effect
articles_retro %>%
ggplot(aes(x = power)) +
geom_histogram(bins = 10) +
facet_wrap(~ prop_true_effect) +
labs(title = "Distribution of power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect")
# geom_density()
articles_retro %>%
ggplot(aes(x = typeM)) +
geom_histogram(bins = 10) +
facet_wrap(~ prop_true_effect) +
labs(title = "Distribution of type M error in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect") +
scale_x_continuous(trans='log10')
articles_retro %>%
ggplot(aes(x = typeS)) +
geom_histogram(bins = 10) +
facet_wrap(~ prop_true_effect) +
labs(title = "Distribution of type S error in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect")
Then, we look at the relation between power, type M and type S error and true effect size.
articles_retro %>%
ggplot() +
geom_point(aes(x = power, y = typeM)) +
facet_wrap(~ prop_true_effect) +
labs(title = "Link between type M and power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect", x = "Power", y = "Type M")
articles_retro %>%
ggplot() +
geom_point(aes(x = power, y = typeS)) +
facet_wrap(~ prop_true_effect) +
labs(title = "Link between type S and power in retrodesign simulations", subtitle = "If the size of true effect was a fraction of the measured effect", x = "Power", y = "Type S")
articles_retro %>%
filter(typeM < Inf) %>%
group_by(prop_true_effect) %>%
summarise(typeM = mean(typeM, na.rm = TRUE)) %>%
ggplot() +
geom_point(aes(x = prop_true_effect, y = typeM)) +
labs(title = "Link between type M and 'true effect' in retrodesign simulations", x = "True effect as a proportion of the measured effect", y = "Type M")
articles_retro %>%
group_by(prop_true_effect) %>%
summarise(typeS = mean(typeS, na.rm = TRUE)) %>%
ggplot() +
geom_point(aes(x = prop_true_effect, y = typeS)) +
labs(title = "Link between type S and 'true effect' in retrodesign simulations", x = "True effect as a proportion of the measured effect", y = "Type S")
We can also look at how power, type M and type S error evolved with publication date.
articles_retro %>%
group_by(year = year(pub_date), prop_true_effect) %>%
mutate(power = mean(power, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_point(aes(x = year, y = power)) +
facet_wrap(~ prop_true_effect) +
labs(title = "Evolution of power with publication date", x = "Publication date", y = "Power")
articles_retro %>%
group_by(year = year(pub_date), prop_true_effect) %>%
mutate(typeM = mean(typeM, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_point(aes(x = year, y = typeM)) +
facet_wrap(~ prop_true_effect) +
labs(title = "Evolution of type M with publication date", x = "Publication date", y = "Type M")
articles_retro %>%
group_by(year = year(pub_date), prop_true_effect) %>%
mutate(typeS = mean(typeS, na.rm = TRUE)) %>%
ungroup() %>%
ggplot() +
geom_point(aes(x = year, y = typeS)) +
facet_wrap(~ prop_true_effect) +
labs(title = "Evolution of type S with publication date", x = "Publication date", y = "Type S")
We then download the full texts (using the ft_get functions). The full texts are stored in an xml format in the cache directory. Note that due to my IP address being located outside of Columbia, I cannot access the texts from Scopus.
Once we have downloaded all the files, we can put them into a table format, before analyzing them.
To analyse the texts, we first start by simply exploring the proportion of articles mentioning the words “missing” and “power”.